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ABSTRACT 

In  this  work,  we  present  our  participation  in  the  microblog 
track  in  TREC-2014,  building  upon  our  first  participation 
last  year.  We  present  our  approaches  for  the  two  tasks  of  this 
year:  temporally-anchored  ad-hoc  search  and  tweet  timeline 
generation.  For  the  ad-hoc  search  task,  we  used  topical  ex¬ 
pansion  in  addition  to  temporal  models  to  perform  retrieval. 
Our  results  show  that  our  run  based  on  the  typical  pseudo 
relevance  feedback  query  expansion  outperformed  all  of  our 
other  runs  with  a  relatively  high  mean  average  precision 
(MAP).  As  for  the  timeline  generation  task,  we  approached 
this  problem  using  online  incremental  clustering  of  tweets  re¬ 
trieved  for  a  given  query.  Our  approach  allows  the  dynamic 
creation  of  “semantic”  clusters  while  providing  a  framework 
for  detecting  redundant  tweets  and  selecting  representative 
ones  to  be  added  to  the  final  timeline.  The  results  demon¬ 
strate  that  using  incremental  clustering  of  tweets  retrieved 
through  a  temporal  retrieval  model  produced  the  best  effec¬ 
tiveness  among  the  submitted  runs. 

1.  INTRODUCTION 

Miroblogging  services  such  as  Twitter  are  attracting  users 
looking  to  engage  in  vibrant  and  influential  hubs  for  informa¬ 
tion  sharing  and  finding.  With  hundreds  of  millions  of  tweets 
posted  daily,  a  large  number  of  queries  are  issued  seeking  in¬ 
formation.  Recent  studies  on  Twitter  data  have  emphasized 
the  high  temporality  of  information  published  through  Twit¬ 
ter,  mostly  covering  breaking  news  and  events  [8,  23].  Such 
temporality  of  the  data  is  also  reflected  in  searching  behavior 
over  tweets  [23],  making  it  essential  for  a  microblog  search 
system  to  consider  such  characteristic  of  the  data  and  the 
task.  In  addition,  the  very  short  length  of  queries  (e.g.,  aver¬ 
age  of  3.76  words  in  this  year’s  microblog  ad-hoc  search  task 
at  TREC-2014)  and  tweets  (with  140-character  of  maximum 
length)  makes  searching  for  tweets  a  challenging  task. 

Due  to  these  factors,  a  microblog  search  system  should 
consider  temporal  signals  in  tweets  and  queries  in  addition  to 
augmenting  their  context  to  improve  retrieval.  In  this  work, 
we  aim  at  studying  the  effectiveness  of  retrieval  given  these 
two  main  factors:  temporality  and  context.  We  specifically 
study  ad-hoc  search  given  three  types  of  retrieval  models: 
(1)  a  purely  temporal  model,  (2)  a  query  expansion  model, 
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and  (3)  a  model  that  combines  both  temporal  and  query 
expansion  factors  to  perform  search. 

Given  the  huge  number  of  tweets  that  can  be  retrieved 
using  a  query,  presenting  a  long  list  of  tweets  to  a  user  on 
a  given  information  need  might  not  be  plausible  anymore; 
the  amount  of  tweets  the  user  has  to  go  through  about  a 
topic  can  be  overwhelming  [20].  Minimizing  tweet  redun¬ 
dancy  and  irrelevancy  can  help  provide  a  user  with  more 
informative  and  compact  list  of  tweets  on  a  topic  of  inter¬ 
est.  Continuous  clustering  algorithms  are  among  the  most 
commonly-used  methods  to  bring  summarized  tweet  time¬ 
lines  to  a  user  [15,  20].  In  such  approaches,  online  cluster¬ 
ing,  usually  supported  by  near-duplicate  detection,  is  used 
to  extract  representative  tweets  of  a  large  stream  of  tweets 
on  an  ongoing  topic.  We  employ  these  ideas  to  design  a 
tweet  timeline  generation  system  that  accepts  a  temporally- 
anchored  query  and  provides  the  user  with  a  timeline  of  non- 
redundant,  chronologically-ordered  tweets  posted  before  the 
query  time. 

The  remainder  of  this  paper  is  organized  as  follows.  We 
discuss  our  approach  to  the  temporally-anchored  ad-hoc  search 
task  in  addition  to  the  evaluation  results  in  Section  2.  Sec¬ 
tion  3  describes  how  we  tackled  the  problem  of  tweet  time¬ 
line  generation  (TTG)  along  with  the  evaluation  results.  We 
conclude  this  paper  with  Section  4. 

2.  AD-HOC  SEARCH 

The  temporally-anchored  ad-hoc  search  task  is  one  of  the 
microblog  track  tasks  at  TREC  that  continued  since  2011  [18, 
21,  11].  Given  a  free-text  query  issued  at  a  given  time,  this 
task  aims  to  retrieve  timely  relevant  tweets  for  that  query. 
To  perform  this  task,  we  leverage  retrieval  models  based  on 
two  main  intuitions.  First,  due  to  the  temporality  of  the 
task  and  the  data,  temporal  retrieval  models  might  be  effec¬ 
tive  in  this  task  as  demonstrated  in  previous  studies  [3,  5, 
12,  4].  Second,  the  very  short  length  of  tweets  and  queries 
can  impede  effective  retrieval  which  motivated  utilizing  con¬ 
text  expansion  methods  in  microblog  ad-hoc  search  [2,  5,  24, 
16].  In  total,  we  work  with  three  retrieval  models  described 
next. 

2.1  Retrieval  Models 

2.1.1  Query  Likelihood  ( QL) 

All  of  the  models  we  use  in  this  work  benefit  from  the 
Query  Likelihood  (QL)  model  [19]  in  retrieval.  This  model 
ranks  documents  by  the  likelihood  that  their  language  mod- 
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els  generated  the  query  as  follows: 

P(D\Q)  tx  P(Q\D)P(D)  (1) 

where  D  is  a  document  and  Q  is  the  query.  Assuming  a  uni¬ 
form  document  prior  P(D)  and  terms  independence,  docu¬ 
ments  can  be  ranked  by 

P(D\Q)  cc  P{Q\D)  =  Yl  P(w\D)  (2) 

w£Q 

more  specifically,  we  use  the  log-likelihoods  to  rank  docu¬ 
ments  by 

^log  P(w\D)  (3) 

weQ 

where  P(w\D)  is  computed  using  maximum  likelihood  esti¬ 
mate  (MLE)  with  Dirichlet  smoothing  [25]  as  follows: 

=  tfw,D+nP{w\C) 

v  '  ’  \D\+n  w 

where  tf  W}D  is  the  term  frequency  of  w  in  D,  P(w\C)  is  esti¬ 
mated  using  MLE  over  the  collection  C  and  the  smoothing 
factor  /r  is  a  free  parameter  for  this  retrieval  model. 

2.1.2  Time-based  Exponential  Priors  ( t-EXP ) 

The  t-EXP  model  [10]  is  a  temporal  variation  of  the  QL 
model  in  which  an  exponential  decay  factor  is  used  as  a 
document  prior  as  follows: 

P(D\Q)  cc  P(Q\D)  ■  r  ■  e~r'td  (5) 

where  r  is  a  decay  rate  factor,  and  td  is  the  posting  time 
difference  in  days  between  D  and  Q.  As  with  the  QL  model, 
we  rank  documents  using  log-likelihoods  by 

logP(-iu|D)  +  log  (r  ■  e~r'td)  (6) 

weQ 

2.1.3  Time-based  Query  Relevance  Modeling 
(t-QRM) 

f-QRM  [7]  is  a  variant  of  the  typical  query  relevance  mod¬ 
eling  approach  [9]  that  uses  a  temporal  query  relevance  model 
computed  as  follows: 

P(w\Q)  =  J2pHt,Q)P(t\Q)  (7) 

ter 

where  t  is  a  timestamp  in  unit  of  days  and  T  is  the  set  of 
timestamps  in  the  collection.  Given  an  initially-retrieved 
list  Rk  retrieved  using  the  QL  model,  we  estimate  P(t\Q)  as 
the  normalized  sum  of  retrieval  scores  of  documents  posted 
within  t.  The  probability  P(w\t,Q)  can  be  computed  as 
follows: 

P(w\t,  Q)  =  J2  PHD)P(D\t,  Q)  (8) 

Det 

P(D\t,Q)  is  assumed  to  be  uniform  over  all  documents  in 
Rk  posted  within  t.  P(w\D)  is  computed  using  the  MLE. 
Once  P(w\Q)  is  computed  for  all  terms  in  Rk,  we  expand  the 
query  with  the  m  terms  with  the  highest  probability.  Given 
the  expanded  query,  the  final  results  are  retrieved  using  the 
typical  QL  model.  Both  the  initial  list  size  k  and  the  number 
of  expansion  terms  m  are  free  parameters  for  this  model. 


2.1.4  PRF-based  Query  Expansion  (QE) 

Earlier  work  on  microblog  search  showed  that  query  ex¬ 
pansion  with  Pseudo  Relevance  Feedback  (PRF)  [9]  has  good 
effectiveness  in  this  task  [14,  2,  17].  In  typical  PRF-based 
retrieval,  a  query  is  expanded  using  the  m  top-scoring  terms 
extracted  from  an  initially  retrieved  list  Rk  given  Q.  In  this 
work,  we  used  a  tf-idf  [13]  like  scoring  function  to  score 
terms  over  all  documents  in  Rk  as  follows: 

Score(w ,  Rk)  =  tfw,Rk  ■  idf(w)  (9) 

tfw,Rk  is  the  term  frequency  of  w  in  Rk  and  idf(w)  is  the 
inverse  document  frequency  of  w  computed  as  idf(w)  = 
log  Once  we  expand  the  query  with  m  terms,  we 

use  the  QL  model  to  retrieve  the  final  list  of  documents  us¬ 
ing  the  expanded  query.  Both  m  and  k  are  free  parameters 
for  this  model. 

2.2  Evaluation  Setup 

Similar  to  last  year,  the  track  used  the  Tweetsl3  collection 
of  approximately  243  million  tweets  [11].  Participants  can 
access  and  retrieve  tweets  from  this  collection  by  submitting 
a  query  to  the  track-provided  API1  [11].  Given  the  query, 
the  API  returns  a  list  of  tweets  using  the  QL  retrieval  model 
from  the  Tweetsl3  collection.  Participants  can  then  use 
their  own  retrieval  model  to  process  this  list  and  produce  a 
final  one. 

Evaluation  of  the  2014  ad-hoc  search  task  is  performed 
given  a  list  of  55  new  topics  released  with  Tweetsl3.  We 
submitted  four  official  runs  based  on  three  retrieval  mod¬ 
els  (discussed  in  Section  2.1):  PRF-based  query  expansion, 
£-EXP  and  f-QRM.  We  tuned  the  parameters  of  these  re¬ 
trieval  models  using  60  topics  released  with  the  microblog 
track  in  TREC-2013  with  the  Tweetsl3  collection  [11].  We 
also  removed  retweets  and  non-English  tweets  from  result 
lists;  language  of  a  tweet  is  detected  using  an  open-source 
language  detection  tool2.  We  evaluated  retrieval  using  pre¬ 
cision  at  rank  30  (P@30)  and  mean  average  precision  (MAP) 
that  were  the  primary  evaluation  measures  used  in  previous 
runs  of  this  task  [18,  21,  11], 

2.3  Experimental  Results 

We  present  each  of  our  four  officially-submitted  runs  in 
Table  1  below.  Whenever  the  QL  model  is  used,  we  set 
g,  =  1000. 2  We  present  results  on  retrieval  effectiveness  of 

Table  1:  Description  of  our  ad-hoc  search  official 
runs _ 


Run 

Model 

Parameters 

QUQueryExp5D25T 

PRF-QE 

k  =  5  ,m  —  25 

QUTmpDecay 

£-EXP 

r  =  0.05 

QU  Query  Exp  10D 1 5T 

PRF-QE 

lO 

II 

s 

o 

II 

QUTQRM 

f-QRM 

k  =  25, m  =  5  j 

these  runs  in  Table  2.  We  also  compare  the  performance  of 
our  official  runs  to  two  baselines: 

1  https:  / /git  hub.  com /lintool/twitter-tools  /  wiki/TREC- 
2013-API-Specihcations 

2https://code. google. com/p/language-detection/ 

3We  tried  different  values  for  this  parameter  and  found  that 
this  value  produces  best  results  over  Tweetsl3. 


•  Baselinel4:  a  run  based  on  the  underlying  retrieval 
model  of  the  common  API,  i.e. ,  Lucene’s  implementa¬ 
tion  of  query  likelihood  model  with  Dirichlet  smooth¬ 
ing  [11], 

•  Median:  The  median  retrieval  results  of  all  automatic 
runs  submitted  to  this  task. 


Table  2:  MAP  and  P@30  of  each  run.  *  and/or  o  de¬ 
notes  significance  difference  from  Baselinel4  and/or 
Median  respectively.  Best  value  per  measure  is 
boldfaced. _ 


Run 

MAP 

P@30 

Baselinel4 

0.4250 

0.6461 

QUQueryExp5D25T 

0.5155*’° 

0.6697° 

QUTmpDecay 

0.4337 

0.6473 

QLTQueryExplOD15T 

0.4932*'° 

0.6436 

QUTQRM 

0.4704° 

0.6267 

Median 

0.4155 

0.6261 

We  notice  that  our  runs  had  better  MAP  compared  to  both 
baselines.  However,  only  the  two  runs  based  on  prf-based 
query  expansion  had  significantly  higher  MAP  than  Base- 
linel4.  Moreover,  these  two  runs  along  with  QUTQR.M 
had  significantly  better  effectiveness  compared  to  the  me¬ 
dian  run.  Interestingly,  we  see  that  only  one  run  had  bet¬ 
ter  P@30  than  Baselienl4.  Overall,  3  out  of  4  runs  had  a 
slightly  higher  P@30  than  the  median,  and  as  with  MAP, 
the  improvement  was  significant  with  QUQueryExp5D25T 
run. 

The  results  showed  that  the  non-temporal  run  QUQuery- 
Exp5D25T  had  the  best  performance  on  both  measures  com¬ 
pared  to  other  runs  and  baselines.  This  shows  that  the  typ¬ 
ical  and  rather  simple  prf-based  QE  is  an  effective  retrieval 
approach  with  microblog  ad-hoc  search. 

As  for  temporal  models,  we  see  that  the  run  QUTmpDe- 
cay  has  almost  the  same  performance  as  Baselinel4.  This  is 
not  surprising  since  the  retrieval  model  t-EXP  is  based  on 
the  QL  model  but  using  a  temporal  decay  factor  as  a  doc¬ 
ument  prior.  This  might  indicate  that  using  such  prior  did 
not  help  in  capturing  the  temporal  aspect  of  the  data  and 
the  task.  The  QUTQR.M  run  had  almost  the  same  P@30 
compared  to  Baselinel4  but  it  notably  improved  MAP  sug¬ 
gesting  it  helped  improve  the  overall  ranking  of  tweets,  but 
not  necessarily  the  top  30  ones.  To  understand  the  behavior 
of  such  models  in  relation  to  the  given  queries,  analysis  of 
the  temporal  nature  of  queries  is  needed. 

3.  TWEET  TIMELINE  GENERATION 

Timeline  generation  is  a  new  task  that  has  been  just  intro¬ 
duced  this  year  at  the  microblog  track.  It  aims  at  producing 
a  timeline  of  non-redundant  chronologically-ordered  tweets 
that  are  relevant  to  a  given  query  issued  at  time  qt.  The 
timeline  basically  constitutes  a  summary  for  a  topic  (e.g., 
event)  represented  by  the  given  query.  The  definition  of 
the  task  inherently  imposes  the  need  for  an  initial  list  of 
“potentially-relevant”  tweets,  which  indicates  that  the  new 
task  is  highly-dependent  on  the  quality  of  the  retrieval  result 
list  Rq  (and  thus  the  retrieval  model  used  to  retrieve  those 
results). 


3.1  Online  Clustering  Approach 

This  year,  we  adopted  a  simple  online-clustering  tech¬ 
nique  [1]  to  detect  sub-events  that  are  not  redundant  before 
producing  the  final  timeline  for  a  given  query.  The  ratio¬ 
nale  behind  this  technique  is  that  we  need  to  detect  such 
clusters  without  determining  their  number  in  advance.  In 
online-clustering,  the  data  to  be  clustered  is  processed  in  a 
stream,  where  the  incoming  data  can  either  be  added  to  an 
existing  cluster  or  form  a  new  cluster,  thus  having  a  dynamic 
set  of  clusters.  The  approach  pipeline  illustrated  in  Fig.  1  is 
outlined  as  follows: 

1.  Ad-hoc  Retrieval:  Given  the  query  q  at  time  qt,  a 
ranked  list  of  1000  tweets  is  retrieved  using  model  m. 

2.  Duplicate  Removal:  Duplicates  (or  near-duplic.ates) 
of  tweets  were  removed  from  the  retrieval  results  by 
normalizing  the  tweets  (i.e.,  removing  stop  words,  URLs, 
and  mentions)  and  then  hashing  the  normalized  tweets  [22]. 

3.  Tweet  Streaming:  Only  the  top  k  tweets  were  con¬ 
sidered  for  timeline  generation  after  removing  the  near¬ 
duplicates.  Retrieval  results  are  ordered  in  some  crite¬ 
rion  (e.g.,  chronologically,  or  based  on  retrieval  scores) 
to  form  a  stream  of  k  tweets.  The  algorithm  then  pro¬ 
cesses  the  stream,  one  tweet  at  a  time. 

4.  Clustering:  We  then  construct  clusters  that  repre¬ 
sent  “sub-topics”  by  processing  the  tweet  stream.  Ini¬ 
tially,  there  are  no  clusters.  Each  incoming  tweet  is 
either  added  to  an  existing  cluster  if  it  exhibits  a  high 
similarity  to  it,  or  forms  a  new  cluster  if  none  of  the 
existing  ones  were  similar.  Similarity  between  a  tweet 
and  a  cluster  can  be  measured  in  different  ways,  e.g., 
similarity  between  the  tweet  and  the  cluster’s  centroid. 

5.  Cluster  Filtering:  Singleton  clusters  (i.e.,  clusters  of 
only  one  tweet)  can  optionally  be  filtered  out  (i.e.,  not 
represented  in  the  timeline)  as  they  might  be  outliers. 

6.  Tweet  Selection:  After  clustering  all  tweets  in  the 
stream,  each  cluster  elects  one  or  more  tweets  to  repre¬ 
sent  it  in  the  final  timeline.  There  are  several  ways  to 
select  such  tweets,  e.g.,  the  tweet  that  is  most  similar 
to  the  centroid. 

3.2  Baseline 

Since  the  retrieved  tweets  that  appear  in  the  tweet  stream 
above  can  (by  definition)  represent  a  timeline,  we  used  that 
list  as  a  beseline  approach  to  which  we  compare  our  online 
clustering  approach. 

3.3  Submitted  Runs 

Prior  to  TREC,  the  track  organizers  shared  with  the  par¬ 
ticipants  a  small  training  set  based  on  10  queries  from  Tweetsll 
collection.  We  have  conducted  preliminary  experiments  us¬ 
ing  that  set  with  different  configurations  and  parameters  for 
each  of  the  steps  of  the  proposed  approach.  We  eventually 
chose  the  4  runs  described  in  Table  3. 

Two  submitted  runs  (indicated  by  BL  postfix)  are  based 
on  the  baseline  approach  using  two  different  retrieval  mod¬ 
els.  The  other  two  (indicated  by  CL  postfix)  used  the  online 
clustering  approach  with  two  other  different  retrieval  models 
as  well. 


Figure  1:  Pipeline  of  our  TTG  approach. 


Table  3:  Description  of  our  official  TTG  runs 


Run 

Model 

Parameters 

QUQEd5t25TTgBL 

PRF-QE 

k  =  5  ,m  =  25 

QUTqrmTTgBL 

t-QRM 

k  =  25,  m  =  5 

QUTmpDecayTTgCL 

t-EXP 

r  =  0.05 

QUQEdl0tl5TTgCL 

PRF-QE 

k  =  10,  m  =  15 

In  all  of  the  submitted  runs: 

•  The  top  75  tweets  were  selected  after  duplicate  detec¬ 
tion  to  form  the  tweet  stream. 

•  The  most  similar  tweet  to  the  query  in  a  cluster  (i.e., 
the  one  with  highest  relevance  score,  which  might  gen¬ 
erally  change  over  the  course  of  stream  processing) 
acted  as  its  centroid  and  hence  the  cluster  similarity 
with  any  incoming  tweet  was  measured  by  the  similar¬ 
ity  between  the  incoming  tweet  and  the  most  relevant 
tweet.  That  tweet  was  eventually  selected  as  the  rep¬ 
resentative  of  the  cluster  in  the  timeline. 

•  Singleton  clusters  were  not  filtered  out. 

•  Cosine  similarity  was  used  as  the  similarity  function.  A 
similarity  threshold  of  0.6  was  used  to  guide  clustering 
decisions. 

3.4  Evaluation  Setup 

The  same  55  queries  used  in  the  ad-hoc  search  task  were 
also  used  in  the  TTG  task.  The  submitted  runs  in  the  ad- 
hoc  task  consitututed  the  judgment  pool  for  TTG  as  well. 
An  additional  round  of  manual  judgments  was  performed 
on  the  tweets  that  were  judged  as  relevant  to  each  query  to 
form  semantic  clusters  containing  redundant  tweets. 

System  efffectiveness  is  measured  using  cluster  precision 
and  two  versions  of  cluster  recall.  Cluster  precsion  P  is 
defined  as  the  percentage  of  distinct  semantic  clusters  that 
are  represented  in  the  generated  timeline  out  of  the  tweets 
in  that  timeline.  The  unweighted  veriosn  of  cluster  recall 
Ru  is  defined  as  the  percentage  of  distinct  semantic  clusters 
that  are  represented  in  the  generated  timeline  out  of  the 
judged  semantic  clusters.  The  weighted  version  Rw  weights 
the  semantic  clusters  based  on  the  aggregate  relevance  levels 
of  the  tweets  included  in  each  cluster.4  Two  versions  of  FI 
are  then  used  as  the  figure  of  merit,  Flu  and  Flw- 


4https:  / /github.com/lintool  /  twitter-tools  /  wiki/TREC- 
201 4- Track-  Guidelines 


3.5  Experimental  Results 

Table  4:  Evaluation  (un-weighted)  results  of  TTG 
submitted  runs.  Best  precision  and  recall  values  are 
italiced  and  best  FI  value  is  boldfaced. _ 


Run 

P 

Ru 

Flu 

QUQEd5t25TTgBL 

0.2436 

0.3795 

0.2967 

QUTqrmTTgBL 

0.2366 

0.3727 

0.2894 

QUQEdl0tl5TTgCL 

0.3049 

0.3277 

0.3159 

QUTmpDecayTTgCL 

0.3236 

0.3277 

0.3256 

Table  5:  Evaluation  (weighted)  results  of  TTG  sub¬ 
mitted  runs.  Best  precision  and  recall  values  are 
italiced  and  best  FI  value  is  boldfaced. _ 


Run 

P 

Rw 

Flw 

QUQEd5t25TTgBL 

0.2436 

0.5660 

0.3406 

QUTqrmTTgBL 

0.2377 

0.5637 

0.3333 

QUQEdl0tl5TTgCL 

0.3049 

0.5316 

0.3875 

QUTmpDecayTTgCL 

0.3236 

0.5167 

0.3980 

Tables  4  and  5  show  the  performance  of  our  submitted 
TTG  runs  in  the  measures  described  earlier.  P  and  R  in¬ 
dicate  the  average  precision  and  recall  respectively  over  all 
queries.  FI  is  just  computed  using  the  average  precision  and 
corresponding  avarage  recall,  not  as  an  average  FI  over  all 
queries. 

The  results  show  that,  while  the  baseline  approach  had 
better  recall  (as  it  maximizes  the  number  of  represented  clus¬ 
ters),  the  online  clustering  approach  exhibited  better  preci¬ 
sion  (as  it  avoids  redundant  tweets/clusters)  and  thus  bet¬ 
ter  FI  values.  Moreover,  the  exponential-decay-based  model 
had  better  FI  values  than  the  PRF-based  QE  model.  More 
experiments  and  analysis  of  the  results  are  needed  to  explain 
the  reason  behind  that.  We  also  notice  that  the  FI  values 
are  relatively  low,  which  shows  either  the  difficulty  of  the 
problem  or  the  opportunity  for  improvements. 

No  median  results  per  query  (across  participants)  were 
shared  by  the  track  organizers,  however,  FI  results  of  all 
anonymous  submmited  runs  from  all  participants  (about  45 
runs)  were  shared  and  illustrated  in  Figures  2  and  3  for 
unweighted  and  weighted  versions  respectively.  In  both  fig¬ 
ures,  FI  values  of  QU  runs  were  circled  and  the  best  of  them 
was  marked  by  the  corresponding  precision  and  recall  val¬ 
ues.  In  both  cases,  the  best  QU  run  was  ranked  among  the 
top  10  (or  probably  11)  submitted  runs,  while  all  of  them 
were  ranked  better  than  the  median  submitted  run.  This 
indicates  the  potential  of  the  online  clustering  approach  for 
tweet  timeline  generation  problem. 


Figure  2:  Performance  of  QU  runs  relative  to  other 
submitted  runs  (unweighted  measures)  in  the  TTG 
task. 


Figure  3:  Performance  of  QU  runs  relative  to  other 
submitted  runs  (weighted  measures)  in  the  TTG 
task. 


4.  CONCLUSION 

Continuing  from  our  last  year  participation  in  the  track  [6], 
we  again  turned  to  context  expansion-based  retrieval  mod¬ 
els  to  perform  ad-hoc  search.  We  used  two  query  expansion 
retrieval  models,  one  that  is  the  typical  prf-based  and  the 
other  uses  temporal  aspects  of  the  query  in  selecting  expan¬ 
sion  terms.  Furthermore,  we  retrieve  tweets  using  a  tem¬ 
poral  model  that  was  found  effective  in  this  context.  The 
results  showed  the  superiority  of  prf-based  query  expansion 
retrieval  over  all  other  retrieval  models  we  used. 

In  our  work  on  the  TTG  task,  we  employed  the  same  re¬ 
trieval  models  used  in  ad-hoc  search  to  retrieve  tweets  for 
a  given  query.  Online  clustering  of  tweets  with  the  help  of 
near-duplicate  detection  was  then  used  to  produce  a  time¬ 
line  for  a  given  query.  The  results  showed  that  clustering 
of  tweets  retrieved  through  the  temporal  query  expansion 
retrieval  model  had  the  best  effectiveness  compared  to  our 
other  TTG  runs.  Based  on  FI  measure,  this  run  was  also 
ranked  among  the  top  10  runs  submitted  to  this  task. 
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